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Abstract 

Orientation selectivity is the most striking feature of simple cell coding in VI which has 
been shown to emerge from the reduction of higher-order correlations in natural images in 
a large variety of statistical image models. The most parsimonious one among these mod- 
els is linear Independent Component Analysis (ICA), whereas second-order decorrelation 
transformations such as Principal Component Analysis (PCA) do not yield oriented filters. 
Because of this finding it has been suggested that the emergence of orientation selectivity 
may be explained by higher-order redundancy reduction. In order to assess the tenability 
of this hypothesis, it is an important empirical question how much more redundancies can 
be removed with ICA in comparison to PCA, or other second-order decorrelation methods. 
This question has not yet been settled, as over the last ton years contradicting results have 
been reported ranging from less than five to more than hundred percent extra gain for 
ICA. Here, we aim at resolving this conflict by presenting a vcrj^ careful and comprehen- 
sive analysis using three evaluation criteria related to redundancy reduction: In addition 
to the multi-information and the average log-loss we compute, for the first time, complete 
rate-distortion curves for ICA in comparison with PCA. Without exception, we find that 
the advantage of the ICA filters is surprisingly small. We conclude that orientation se- 
lective receptive fields in primary visual cortex cannot be explained in the framework of 
linear redundancy reduction. Furthermore, we show that a simple spherically symmetric 
distribution with only two parameters can fit the data even better than the probabilistic 
model underlying ICA. Since spherically symmetric models arc agnostic with respect to 
the specific filter shapes, we conlude that orientation selectivity is unlikely to play a critical 
role for redundancy reduction even if the class of transformations is not limited to linear 
ones. 

Author Summary 

What is the computational goal underlying the response properties of simple cells in primary 
visual cortex? Since the Nobel prize winning work of Hubel and Wiesel we have known that 
orientation selectivity is an important feature of simple cells in VI. A conclusive and quan- 
titative explanation for the actual purpose of this biological structure, however, still remains 
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elusive. Among the hypotheses on its purpose, the idea of redundancy reduction (Barlow, 1959) 
stands out in being formalized precisely in terms of a quantitative objective function. While 
it has been reported previously that orientation selectivity can reduce the redundancy in the 
representation of color images by more than 100%, we here provide strong evidence that the 
actual gain is only 3% for both color and gray value images. Thus, our results challenge the 
conception that the major goal of orientation selective coding in VI is redundancy reduction. 
Furthermore, we now have a reliable reference point for quantitative investigations of nonlinear 
early vision models. 

Introduction 

It is a long standing hypothesis that neural representations in sensory systems are adapted to 
the statistical regularities of the environment [Tj [2] . Despite widespread agreement that neural 
processing in the early visual system must be influenced by the statistics of natural images, 
there are many different viewpoints on how to precisely formulate the computational goal the 
system is trying to achieve. At the same time, different goals might be achieved by the same 
optimization criterion or learning principle. Redundancy reduction [2], the most prominent 
example of such a principle, can be beneficial in various ways: it can help to maximize the 
information to be sent through a channel of limited capacity it can be used to learn the 
statistics of the input [1] or to facilitate pattern recognition [S]. 

Besides redundancy reduction, a variety of other interesting criteria such as sparseness [S] [7] , 
temporal coherence [H], or predictive information [HI HO] have been formulated. An important 
commonality among all these ideas is the tight link to density to density estimation of the input 
signal. Redundancy reduction, in particular, can be interpreted as a special form of density 
estimation where the goal is to find a mapping which transforms the data into a representation 
with statistically independent coefficients. This idea is related to projection pursuit [TT] where 
density estimation is carried out by optimizing over a set of possible transformations in order 
to match the statistics of the transformed signal as good as possible to a pre-specified target 
distribution. Once the distribution has been matched, applying the inverse transformation 
effectively yields a density model for the original data. From a neurobiological point of view, 
we may interpret the neural response properties as an implementations of such transformations. 
Accordingly, the redundancy reduction principle can be seen as special case of projection pursuit 
where the target distribution is assumed to have independent marginal distributions. This 
renders it a very general principle. 

A crucial aspect of this kind of approach is the class of transformations over which to 
optimize. From a statistician's point of view it is important to choose a regularized function 
space in order to avoid overfitting. On the other hand, if the class of possible transformations 
is too restricted, it may be impossible to find a good match to the target distribution. From 
a visual neuroscientist's point of view, the choice of transformations should be related to the 
class of possible computations in the early visual system. 

Intriguingly, a number of response properties of visual neurons have been reproduced by 
optimizing over the class of linear transformations on natural images for redundancy reduction 
(for a review see [T5] [13]). For instance, Buchsbaum and Gottschalk revealed a link between 
the second-order statistics of color images and opponent color coding of retinal ganglion cells by 
demonstrating that decorrelating natural images in the trichromatic color space with Principal 
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Component Analysis (PCA) yields the luminance, the red-green, and the blue-yellow channel 
|14j . Atick and Redlich derived the center-surround receptive fields by optimizing a symmetric 
decorrelation transformation |15j . Later, also spatio-temporal correlations in natural images or 
sequences of natural images were linked to the receptive field properties in the retina and the 
lateral geniculate nucleus (LGN) [ro i lTT t lTS]. 

On the way from LGN to primary visual cortex, orientation selectivity emerges as a striking 
new receptive field property. A number of researchers [e.g. 1191 120] have used the covariance 
properties of natural images to derive linear basis functions that exhibit similar properties. 
Decorrelation alone, however, was not sufficient to achieve this goal. Rather, additional con- 
straints were necessary, such as spatial locality or symmetry. 

It was not until the reduction of higher-order correlations were taken into account that the 
derivation of localized and oriented band-pass filters — resembling orientation selective receptive 
fields in VI — was achieved without the necessity to assume any further constraints. Those 
filters were derived with Independent Component Analysis (ICA) , a generalization of Principal 
Component Analysis (PCA), which in addition seeks to reduce all higher-order correlations 

mm- 

This finding suggests that the orientation selectivity of VI simple cells may be explained by 
the minimization of higher-order correlations. But in order to test this hypothesis, one must 
also demonstrate that those filters can yield a substantial advantage for redundancy reduction. 
If this is not the case, the objective of redundancy reduction does not seem to be critical for 
the emergence of orientation selectivity in primary visual cortex. 

The importance of such a quantitative assessment has already been pointed out in earlier 
studies [12 US H3 HSl US] • In particular, a direct quantification of the redundancy reduction 
achieved with ICA and PCA for natural images has been carried out in [22l ESI [Mj |26] but 
they differ considerably in their conclusions regarding the effect of higher-order correlations in 
linear image models ranging from less than five to more than hundred percent. 

This is a very unsatisfying situation as even for the simplest class of models — the linear 
case — the literature does not provide a clear answer to the question of how important ori- 
entation selectivity may be for the reduction of redundancy in natural images. In order to 
resolve this issue, we carried out this comprehensive study which exclusively investigates lin- 
ear image models that have been around for more than a decade by now. Our goal is to 
establish a reliable reference against which more sophisticated image models can be compared 
to in the future. We elaborate on our own previous work |26j by using an optimized ICA 
algorithm on the color image data set for which the largest higher-order redundancy reduc- 
tion has been reported previously. The advantage of the resulting orientation selective ICA 
filters is tested comprehensively with three different types of analysis that are related to the 
notion of redundancy reduction and coding efficiency: (A) multi-information reduction, (B) 
average log-likelihood, and (C) rate-distortion curves. Since we put strong emphasis on the 
reproducibility and verifiabiUty of our results, we provide our code and the dataset online 
(|http : //www . kyb . tuebingen . mpg . de/bethge/ code/QICA/ ) . 

Our results show that orientation selective ICA filters do not excel in any of these measures: 
We find that the gain of ICA in redundancy reduction over a random decorrelation method is 
only about 3% for color and gray- value images. In terms of rate-distortion curves, ICA performs 
even worse than PCA. Furthermore, we demonstrate that a simple spherically symmetric model 
with only two parameters fits the data significantly better. Since in this model the specific shape 
of the filters is ignored, we conclude that it is unlikely that orientation selectivity plays a critical 
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role for redundancy reduction even if the class of transformations is not limited to linear ones. 

Material and Methods 

An important difRculty in setting up a quantitative comparison originates from the fact that it 
bears several issues that may be critical for the results. In particular, choices have to be made 
regarding the evaluation criteria, the image data, the estimation methods, regarding which 
linear transformations besides ICA to be inchided in the comparison, and which particular 
implementation of ICA to be used. The significance of the outcome of the comparison will 
therefore depend on how careful these choices have been made and the most relevant issues will 
be addressed in the following. 

Notation and Nomenclature 

For both, color and gray-value data, we write x to refer to single vectors which contain the 
raw pixel intensities. Vectors are indicated by bold font while the same letter in normal font 
with a subindex denotes one of its components. Vectors without subindices usually denote 
random variables, while subindices indicate specific examples. In some cases it is convenient to 
define the corresponding data matrix X = (xi, . . . ,xjv) which holds single images patches in 
its columns. The letter TV denotes the number of examples in the dataset, while n is used for 
the dimension of a single data point. 

Transformations are denoted by W, oftentimes with a subindex to distinguish different 
types. The result of a transformation to either a vector x or a data matrix X will be written 
as y = Wx or F = WX, respectively. 

Probability densities are denoted with the letters p and q, sometimes with a subindex to 
indicate differences between distributions whenever it seems necessary for clarity. In general, 
we use the hat symbol to distinguish between true entities and their empirical estimates. For 
instance, Py{y) = P:x.{W~^y) ■ \ det W\~^ is the true probability density of y after applying a 
fixed transformation W, while Py(y) refers to the corresponding empirical estimate. 

A distribution p(y) is called factorial, or marginally independent, if it can be written as a 
product of its marginals, i.e. p{y) = Yli^iPiiVi) where Pi{yi) is obtained by integrating p{y) 
over all components but y^. 

Finally, the expectation over some entity / with respect to y is written as E [/(y)] 

/ P(y)/(y)'^y- Sometimes, we use the density instead of the random variable in the subindex to 
indicate the distribution, over which the expectation is taken. If there is no risk for confusion 
we drop the subindex. Just as above, the empirical expectation is marked with a hat symbol, 

i.e. E[/(y)] = iEr=i/(y;c). 

How to compcire early vision models? 

A principal complicacy in low-level vision is the lack of a clearly defined task. Therefore, it is 
difliicult to compare different image representations as it is not obvious a priori what measure 
should be used. 
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(A) Multi-Information The first measure we consider is the multi-information [57], which 
is the true objective function that is minimized by ICA over the choice of filters W. The multi- 
information assesses the total amount of statistical dependencies between the components yi of 
a filtered patch y = Wx: 
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The terms h[pj{yj)] and h[p(y)] denote the marginal and the joint entropies of the true distri- 
bution, respectively. The KuUback-Leibler- Divergence or Relative Entropy 
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is an information theoretic dissimilarity measure between two distributions p and q [28 . It is 
always non-negative and zero if and only if p equals q. If the redundancy reduction hypothesis 
is taken literally, the multi-information is the right measure to minimize, since it measures how 
close to factorial the true distribution of the image patches in the representation y really is. 

The application of linear ICA algorithms to ensembles of natural images very reliably yields 
transformations consisting of localized and oriented bandpass filters similar to the receptive 
fields of neurons in VI. It is less clear, however, whether these filter properties also critical 
to the minimization of the multi-information, i.e. the redundancy? In order to assess the 
tenabihty of the idea that a VI simple cell is adjusted to the purpose of redundancy reduction, 
it is important to know whether such a tuning can — in principle — result in a large reduction 
of the multi-information. One way to address this question is to measure how much more the 
multi-information is actually reduced by the ICA filters in comparison to others such as PCA 
filters. This approach has been taken in [25] . 

One problem with estimating multi-information is that it involves the joint entropy h[p(y)] 
of the true distribution which is generally very hard to estimate. In certain cases, however, 
the problem can be bypassed by evaluating the difference in the multi-information between two 
representations x and y. In particular, if y is related to x by the linear transformation y — Wx 
it follows from definition ([T|) and the transformation theorem for probability densities 



Py(y) =Px(x) 
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dy^ 
dx 



= Px(M^"V)-|detI^|- 



that difference in multi-information can be expressed as 



I[p{y)] - /b(x)] - ^[PkiVk)] - h[p{y)] ~ h[Pk{xk)] ~ h[p{x)] 

k \ k / 

= Y ^IPkiVk)] - Y ^\Pki^k)] - log |det W\ . 



For convenience, we chose a volume-conserving gauge (26] where all linear decorrelation trans- 
forms are of determinant one, and hence log |det VF| = 0. This means that differences in multi- 
information are equal to differences of marginal entropies which can be estimated robustly. 
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Thus, our empirical estimates of the multi-information differences are given by: 

A/«^/i[pfc(2/fc)]-^/i[pfc(xfc)] si. |det(W^)| = 1 (2) 

k k 

For estimating the entropy of the univariate marginal distributions, we employ the OPT estima- 
tor introduced in [26] which uses the exponential power family to fit the marginal distributions 
by searching for the OPTimal exponent. This estimator has been shown to give highly reliable 
results for natural images. In particular, it is much more robust than entropy estimators based 
on the sample kurtosis which easily overestimate the multi-information. 

(B) Average Log Loss (ALL) As mentioned earlier, redundancy reduction can be inter- 
preted as a special form of density estimation where the goal is to find a mapping which trans- 
forms the data into a representation with statistically independent coefficients. This means that 
any given transformation specifies a density model over the data. Our second measure, the av- 
erage log- loss (ALL), evaluates the agreement of this density model with the actual distribution 
of the data: 

E[-logp(y)] = - / p(y)logp(y)dy = H[p] + Dki.[v\\p\ (3) 

The average log-loss is a principled measure quantifying how different the model density p{y) 
is from the true density p(y) [23]. Since the KL-divergence is positive and zero if and only if 
p = p the ALL is minimal only if p matches the true density. Furthermore, differences in the 
average log-loss correspond to differences in the coding cost (i.e. information rate) in the case 
of sufficiently fine quantization. For natural images, different image representations have been 
compared with respect to this measure in [22^ '23^ ^24] . 

For the estimation of the average log-loss, we follow Lewicki et al. [2H 132] (We are referring 
here to the first method in the cited references. The defining equation there contains an extra 
term N log a which indicates the role of noise. Note, however, that this term has only a symbolic 
meaning as the noise is assumed to be independent of the representation so that this term can 
in fact be ignored) by simply using the empirical average 

1 ^ 

Ehlogp(y)] « E[-logp(y)] = -- Vlogp(yfc). (4) 
p y " 

k—l 

While this estimator in principle can be prone to overfitting, we control for this risk by evaluating 
all estimates on an independent test set, whose data has not been used during the parameter 
fit. Furthermore, we compare the average log-loss to the parametric entropy estimates h[p\ that 
we use in (A) for estimating the multi-information changes (see Eq. [2]) . The difference between 
both quantities has been named differential log-likelihood |31j and can be used to assess the 
goodness of fit of a model distribution: 

E[-logp] -h[p] = E[logp] -E[logp]. 
p 

The shape of the parametric model is well matched to the actual distribution if the differ- 
ential log-likelihood converges to zero with increasing number of data points. 
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(C) Rate-Distortion Curves Finally, we consider efficient coding ot minimum mean square 
error reconstruction as a third objective. In contrast to the previous objectives, it is now 
assumed that there is a bottleneck, limiting the amount of information that can be transmitted, 
and the goal is to maximize the amount of relevant information transmitted about the image. 
In the context of neural coding, the redundancy reduction hypothesis has oftentimes been 
motivated in terms of coding efficiency. In fact, instead of minimizing the multi-information one 
can equivalently ask for the linear transformation W which maximizes the mutual information 
between its input x and its output + ^ when additive noise ^ is added to the output [32] [33] . 
It is important to note, however, that this minimalist approach of "information maximization" 
is ignorant with respect to how useful or relevant the information is that has been transmitted 

For natural images, the source signal x is a continuous random variable which requires in- 
finitely many bits to be specified with unlimited precision. If there is an information bottleneck, 
however, only a limited amount of bits can be transmitted. Both, the multi-information and 
the average log-loss do not take into account the problem what information should be encoded 
and what information can be discarded. Therefore, a representation optimized only for redun- 
dancy reduction can in fact perform quite poorly with respect to the goal of preserving the 
relevant information during the transmission |26l 1341 I35j . In order to compare two different 
representations with respect to this more complete notion of coding efficiency, it is necessary to 
have a measure for the quality with which a signal can be reconstructed from the information 
that is preserved by the representation. This objective is in fact very much related to the task 
of image compression. 

Clearly, we expect that the criteria for judging image compression algorithms may not 
provide a good proxy to an accurate judgement of what information is considered relevant in a 
biological vision system. In particular, the existence of selective attention suggests that different 
aspects of image information are transmitted at different times depending on the behavioral 
goals and circumstances |13| . That is, a biological organism can change the relevance criteria 
dynamically on demand while for still image compression algorithms it is rather necessary that 
this assessment is made once and forever in a fixed and static fashion. 

The issue of selective attention is outside the scope of this paper. Striving for simplicity 
instead, we will use the mean squared reconstruction error for the pixel intensities. This is 
the measure of choice for high-rate still image compression |36j . In particular, it is common to 
report on the performance of a code by determining its rate-distortion curve which specifies the 
required information rate for a given reconstruction error (and vice versa) |35l . Consequently, 
we will ask for a given information rate, how do the image representations compare with respect 
to the reconstruction error. As result, we will obtain a so-called rate-distortion curve which 
displays the average reconstruction error as a function of the information rate or vice versa. 
The second method used in |22l 130] is an estimate of a single point on this curve for a particular 
fixed value of the reconstruction error. 

The estimation of the rate-distortion curve is clearly the most difficult task among the 
three criteria. We adopt the framework of transform coding |34] extensively used in still image 
compression, which divides the encoding task into two steps: First, the image patches x are 
linearly transformed into y = Wn. Then the coefficients yj are quantized independently of 
each other. Using this framework, we can ask whether the use of an ICA image transformation 
leads to a smaller reconstruction error after coefficient quantization than PCA or any other 
transform. 
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As for quantizing the coefficients, we resort to the framework of variable rate entropy coding 
|37| . In particular, we apply uniform quantization, which is close to optimal for high-rate 
compression |38l I34j . For uniform quantization, it is only required to specify the bin width of 
the coefficients. There is also the possibility to use a different number of quantization levels 
for the different coefficients. The question of how to set these numbers is known as the 'bit 
allocation problem' because the amount of bits needed to encode one coefficient will depend 
monotonically on the number of quantization levels. The number of quantization levels can be 
adjusted in two different but equivalent ways: One possibility is to use a different bin width for 
each individual coefficient. Alternatively, it is also possible to use the same bin width for all 
coefficients and multiply all coefficients with an appropriate scale factor before quantization. 
The larger the variance of an individual coefficient, the more bits will be allocated to represent 
it. 

Here, we will employ the latter approach, for which the bit allocation problem becomes 
an inherent part of the transformation: Any bit allocation scheme can be obtained via post- 
multiplication with a diagonal matrix. Thus, in contrast to the objective function of ICA, the 
rate-distortion criterion is not invariant against post-multiplication with a diagonal matrix. 
For ICA and PCA, we will determine the rate-distortion curve for both, normalized output 
variances ("white ICA" and "white PCA") and normalized basis functions ("normalized ICA" 
and "orthonormal PCA" ) , respectively. 

Decorrelation transforms 

The particular shape of the ICA basis functions is obtained by minimization of the multi- 
information over all invertible linear transforms y ~ VFx. In contrast, the removal of second- 
order correlations alone generally does not yield localized, oriented, and bandpass image basis 
functions. ICA additionally removes higher-order correlations which are generated by linear 
mixing. In order to assess the importance of this type of higher-order correlations for redun- 
dancy reduction and coding efficiency we will compare ICA to other decorrelating image bases. 
Let C = E [xx^] be the covariance matrix of the data and C ~ U DU^ its eigen-decomposition. 

Then, any linear second-order decorrelation transform can be written as 

W = D2-V- D-^/^ ■ (5) 

where D and U are defined as above, V is an arbitrary orthogonal matrix and D2 is an arbitrary 
diagonal matrix. It is easily verified that Y = WX has diagonal covariance for all choices of V 
and D2, i.e. all second-order correlations vanish. This means that any particular choice of V 
and D2 determines a specific decorrelation transform. Based on this observation we introduce 
a number of linear transformations for later reference. All matrices are square and arc chosen 
to be of determinant A™, where m is the number of columns (or rows) of W (i.e. A = "x/TT^ 
is the geometrical mean of the eigenvalues Xi,i — 1, . . . , m). 

Orthogonal principal component analysis (oPCA) 

If the variances of the principle components (i.e. the diagonal elements of D) are all 
different, PCA is the only metric-preserving decorrelation transform and is heavily used 
in digital image coding. It corresponds to choosing V = I„i as the identity matrix and 
D2 = XD^/^, such that VKoPCA = AC/^. 
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White principal component analysis (wPCA) 

Equalizing tlie output variances in the PCA representation sets the stage for the derivation 
of further decorrelation transforms different from PCA. In order to assess the effect of 
equalization for coding efficiency, we also include this "white PCA" representation into 
our analysis: Choose F = as for orthonormal PCA and then set D2 = A^-^m with 
Vdet(L>i/2) such that I^wpca = M D-^/'^U'^ . 

Symmetric whitening (SYM) 

Among the non-orthogonal decorrelation transforms, symmetric whitening stays as close 
to the input representation as possible (in Frobenius norm) [39' . In terms of early vision 
this may be seen as an implementation of a wiring length minimization principle. Re- 
markably, the basis functions of symmetric whitening resemble the center-surround shape 
of retinal ganglion cell receptive fields when applied to the pixel representation of natu- 
ral images |15j . The symmetric whitening transform is obtained by setting V = U and 
D2 = film such that M^sYM = fJ- UD-'^/'^U^ . 

Random whitening (RND) 

As a baseline which neither exploits a special structure with respect to the input represen- 
tation nor makes use of higher-order correlations we also consider a completely random 
transformation. To obtain a random orthogonal matrix we first draw a random matrix G 
from a Gaussian matrix- variate distribution and then we set Frnd = {GG^)-^I'^G. With 
D2 = ulrn we obtain W-rnu = M Vrnd-D^^/^J/^. 

White independent component analysis (wICA) 

Finally, ICA is the transformation which has been suggested to explain the orientation 
selectivity of VI simple cells [21] • Set V — Vic a for which the multi-information I[Y] 
takes a minimum. With D2 = fJ,Im we obtain W^icA = M VicaD~^^^U^ . 

Normalized independent component analysis (nICA) 

Normalized independent component analysis (nICA) differs from white ICA (WwICa) only 
by a different choice of the second diagonal matrix D2 ■ Instead of having equal variance 
in each coefficient, we now choose D2 such that the corresponding basis vector of each 
coefficient has the same length in pixel space. It is easy to see that our first two criteria, 
the multi- information and the negative log- likelihood, are invariant under changes in D2. 
It makes a difference for the rate-distortion curves as in our setup the variance (or, 
more precisely, the standard deviation) determines the bit allocation. Practically, W^icA 
can be determined by using W^wiCA as follows: First, we compute the matrix inverse 
A := and determine the Euclidean norm ai, . . . ,am of the column vectors of A. 

With Da = diag(ai, . . . , a^), we then obtain WnicA = ^/^l,^ ^ -PaWwicA- 

ICA algorithm 

If the true joint probability distribution is known, the minimization of the multi-information 
over all linear transformations can be formulated without any assumptions about the shape of 
the distribution. In practice, the multi-information has to be estimated from a finite amount 
of data which requires to make assumptions about the underlying density. 
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There are many different ICA algorithms which differ in the assumptions made and also in 
the optimization technique employed. The choice of the particular ICA algorithm used here was 
guided by a set of requirements that arise from the specific problem setting. Although a wide 
variety of ICA algorithms has been published, none of them fits exactly all of our requirements. 

We would like to use an ICA algorithm, which gives the ICA image basis the best chance for 
the comparison with other image representations. For the comparison of the multi-information 
reduction, we are using the OPT estimator introduced in |2S] which has been found to give 
the most reliable results. This estimator employs a parametric estimate of the coefficient dis- 
tributions based on the exponential power family which is known to provide an excellent fit to 
the coefficient distributions of natural images [101 US]- Our ICA algorithm should make the 
same assumptions about the data as we make for the final comparison of the multi-information 
reduction. Therefore, we are also using the exponential power family model for the marginal 
densities during the minimization of the multi-information. In addition, we want to have an 
ICA basis which is indistinguishable from the other image representations with respect to the 
second-order statistics. Therefore, we are using a pre-whitened ICA algorithm, whose search 
space is restricted to the subgroup of orthogonal matrices SO{n). One of the most efficient 
ICA methods in the public domain specialized to pre-whitened ICA is FastICA [41j. We use 
this fixed-point algorithm as an initialization. Subsequently, the solution is further refined by 
performing a gradient ascent over the manifold of orthogonal matrices on the likelihood of the 
data, when each marginal is modelled by a the exponential power distribution as in the case of 
the OPT estimator. 

In order to optimize the objective function over the subspace of orthogonal matrices, we 
adapted the algorithms for Stiefel manifolds proposed by Edelman et al. [42^ to the simpler case 
of orthogonal groups and combined it with the line-search routine dbrent from |43j to achieve 
a rather straightforward gradient descent algorithm. For the initialization with FastICA, we 
use the Gaussian non-linearity, the symmetric approach and a tolerance level of 10~^. 

Spherically symmetric model 

A well known result by Maxwell [H] states that the only factorial distribution invariant against 
arbitrary orthogonal transformations is the isotropic Gaussian distribution. Natural images 
exhibit marginals which are significantly more peaked than Gaussian. Nevertheless, their dis- 
tribution docs share the spherical symmetry with the Gaussian as already found by |45| for 
gabor filter pairs and lately exploited by [55^ for nonlinear image representations. Therefore, it 
makes sense to compare the performance of the ICA model with a spherically symmetric model 
of the whitened data yw = W'rndx. Note that any spherically symmetric model is still invari- 
ant under orthogonal transformations while only the Gaussian additionally exhibits marginal 
independence. 

While the radial distribution of a Gaussian (i.e. the distribution over the lengths of the 
random vectors) is a x distribution, whose shape and scale parameter is determined by the 
number of dimensions and the variance, respectively, the spherical symmetric model may be 
seen as a generalization of the Gaussian, for which the radial distribution p{r) with r := ||y||2 
can be of arbitrary shape. The density of the spherically symmetric distribution (SSD) is defined 
as Py(y) — Pr{r) / Sn{r), where S'„(r) = r"~"'^27r"/^/r(n/2) is the surface area of a sphere in 
M" with radius r. For simplicity we will model the radial distribution with a member of the 
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Gamma family 

Pir)-"—^^ ,r>0 (6) 
s"l (u) 

with shape parameter u and scale parameter s, which can be easily matched to the mean and 
variance of the empirical distribution via s — Var[r]/E [r] and u = E [r] /Var[r]. 



Dataset 

The difference in the performance between ICA and other linear transformations clearly depends 
on the data. For gray-scale images we observed in our previous study [26 that the difference 
in the multi-information between ICA and any other decorrelation transform is consistently 
smaller than 5%. In particular, we controlled for the use of different pictures and for the effect 
of different pre-processing steps. 

Here, we resort to the dataset used in a previous study [53] [23]) which among all previous 
studies reported the largest advantage of ICA compared to PCA. This color image dataset is 
based on the Bristol Hyperspectral Images Database [37] that contains multi-spectral recordings 
of natural scenes taken in the surroundings of Bristol, UK and in the greenhouses of Bristol 
Botanical Gardens. The authors of f24] kindly provided to us a pre-processed version of the 
image data where spectral radiance vectors were already converted into LMS values. During 
subsequent processing the reflectance standard was cut out and images were converted to log 
intensities (cf. [M|). 

All images come at a resolution of 256 x 256 pixels. From each image circa 5000 patches of 
size 7x7 pixels were drawn at random locations (circa 40000 patches in total) . For chromatic 
images with three color channels (LMS) each patch is reshaped asa7x7x3 = 147-dimensional 
vector. To estimate the contribution of color information, a comparison with monochromatic 
images was performed where gray- value intensities were computed as / = log(|(L + M + S)) 
and exactly the same patches were used for analysis. In the latter case, the dimensionality of 
a data sample is thus reduced to 49 dimensions. Our motivation to chose 7x7 patches is to 
ensure maximal comparability to the study of |24j . We carried out the same analysis for patch 
sizes of 15 X 15 as well. For the sake of clarity, only the number for the 7x7 patches are reported 
in this paper. The results for 15 x 15 can be found in the additional material. All experiments 
are carried out over ten different training and test sets sampled independently from the original 
images. 

Since the statistics of the constant illumation part of image patches, i.e. the DC component, 
differs significantly from the statistics of luminance variations, we removed the DC component 
from the patches before further transforming them. In order to not affect the entropy of the 
data, we used an orthogonal transformation. The projector PremDC is computed such that the 
first (for each color channel) component of PremDC^ corresponds to the DC component(s) of 
that patch. One such a possible choice is the matrix 

/I •••\^ 
1 1 ••• 
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However, this is not an orthogonal transformation. Therefore, we decompose P into P = QR 
where R is upper triangular and Q is an orthogonal transform. Since P — QR, the first column 
of Q must be a multiple of the vector with all coefficients equal to one (due to the upper 
triangluarity of R). Therefore, the first component of Q^x is a multiple of the DC component. 
Since Q is an orthonomal transform, using all but the first row of for PremDC projects out 
the DC component. In the case of color images PremDC becomes a block-diagonal matrix with 
as diagonal elements for each channel. 

By removing the DC component in that manner, all linear transformations are applied in 
n — 1 dimensions, if n denotes the number of pixels in the original image patch. In this case the 
marginal entropy of the DC-components has to be included in the computation of the multi- 
information in order to ensure a valid comparison with the original pixel basis. We use the 
same estimators as in to estimate the marginal entropy of DC-component. 

Results 

Filter Shapes 

As in previous studies [T] [2T] the filters derived with ICA exhibited orientation selective tuning 
properties similar to those observed for VI simple cells (see Figure fll. For illustration, we 
also show the basis functions learned with PCA and RND in Figure^ The basis functions 
A are obtained by inverting the filter matrix W (including the DC component). The result is 
displayed in the upper panel (Figure [i]A-C). Following common practice, we also visualize the 
basis functions after symmetric whitening (Figure [T] D-F). 
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Figure 1: Examples for Receptive Fields of Various Image Transforms Basis func- 
tions of a random decorrelation transform (RISID), principal component analysis (PCA) and 
independent component analysis (ICA) in pixel space (A-C) and whitened space (E-F). The im- 
age representation in whitened space is obtained by left multiplication with the matrix square 
root of the inverse covariance matrix C~^/^. This figure can only give a rough idea of the shape 
of the basis functions. For a detailed inspection of the basis functions we refer the reader to 
our web page http : / / www . kyb . mp g . de/bethge/code/QICA/ where we provide all the data and 
code used in this paper. 



The basis functions of both PCA and ICA exhibit color opponent coding but the basis 
functions of ICA are additionally localized and orientation selective. The basis functions of 
the random decorrelation transform does not exhibit any regular structure besides the fact that 
they are bandpass. The following quantitative comparisons will show, however, that the distinct 
shape of the ICA basis functions does not yield a clear advantage for redundancy reduction and 
coding efficiency. 



Multi-information 

The multi-information is the true objective function that is minimized by ICA over all possible 
linear decorrelation transforms. Figure [2] shows the reduction in multi-information achieved 
with different decorrelation transforms including ICA for chromatic and gray value images, 
respectively. For each representation, the results are reported in bits per component, i.e. as 
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marginal entropies averaged over all dimensions: 

n 

{h) = -J2h[pM] (7) 



n 

k=l 



Decorrelation transforms on chromatic images g Decorrelation transforms on gray images 




Figure 2: Multi-Information Reduction per Dimension Average differential entropy (h) 
for the pixel basis (PIX), after separation of the DC component (DCS), and after application 
of the different decorrelation transforms. The difference between PIX and RND corresponds to 
the redundancy reduction that is achieved with a random second-order decorrelation transform. 
The small difference between RND and ICA is the maximal amount of higher-order redundancy 
reduction that can be achieved by ICA. Diagram (A) shows the results for chromatic images 
and diagram (B) for gray value images. For both types of images, only a marginal amount can 
be accounted to the reduction of higher order dependencies. 

Table [l] shows the corresponding values for the transformations RND, SYM, PCA and 
ICA. For both chromatic images and gray-value intensities, the lowest and highest reduction is 
achieved with RND or ICA, respectively. However, the additional gain in the multi-information 
reduction achieved with ICA on top of RND constitutes only 3.20% for chromatic images and 
2.39% for achromatic in comparison with the total reduction relative to the pixel basis (PIX). 
This means that only a small fraction of redundancy reduction can actually be accounted to 
the removal of higher-order redundancies with ICA. 
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Absolute Difference 


Relative Difference 




Color 


Gray 




Color 


Gray 


RND-PIX 


-4.0694 ± 0.0043 


-3.1252 ± 0.0043 








SYM-RND 


-0.0593 ± 0.0004 


-0.0259 ± 0.0006 


SYM-RND 
SYM-PIX 


1.44 ± 0.01 


0.82 ± 0.02 


PCA-RND 


-0.0627 ± 0.0008 


-0.0353 ± 0.0011 


PCA-RND 
PCA-PIX 


1.52 ± 0.02 


1.12 ± 0.03 


ICA-RND 


-0.1345 ± 0.0008 


-0.0767 ± 0.0008 


ICA-RND 
ICA-PIX 


3.20 ± 0.02 


2.39 ± 0.02 



Table 1: Comparision of the Multi-Information Reduction for Chromatic and Achro- 
matic Images Differences in the multi-information reduction between various decorrelation 
transforms (SYM, PCA, ICA) relative to a random decorrelation transform (RND) compared 
to the multi-information reduction achieved with the random decorrelation transform relative 
to the original pixel basis (RND-PIX). The absolute multi-information reduction is given in 
bits/component on the left hand side. How much more the special decorrelation transforms 
SYM, PCA and ICA can reduce the multi-information relative to the random (RND) one is 
given in percent on the right hand side. 

One may argue that the relatively small patch size of 7 x 7 pixel may be responsible for the 
small advantage of ICA as all decorrelation functions already getting the benefit of localization. 
In order to address the question how the patch size affects the linear redundancy reduction, 
we repeated the same analysis on a whole range of different patch sizes. Figure [3] shows the 
multi-information reduction with respect to the pixel representation (PIX) achieved by the 
transformations RND and ICA. The achievable reduction quickly saturates with increasing 
patch size such that its value for 7 x 7 image patches is already at about 90% of its asymptote. 
In particular, one can see that the relative advantage of ICA over other transformations is still 
small {r-^ 3%) also for large patch sizes. All Tables and Figures for patch size 15 x 15 can be 
found in the additional material. 
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Figure 3: Patch Size vs. Redundancy Reduction Redundancy reduction as a function 
of the patch size. The graph shows the muhi-information reduction with respect to the pixel 
representation (PIX) achieved by the transformations RND and ICA. The achievable reduction 
quickly saturates with increasing patch size such that its value for 7 x 7 image patches is 
already at about 90% of its asymptote. In particular, the small advantage of ICA over other 
transformation does not only occur for small patch sizes for which spatial localization is imposed 
on the basis functions for all transformations due to the patch boundaries. 

Average log-loss 

Since redundancy reduction can also be interpreted as a special form of density estimation we 
also look at the average log-loss which quantifies how well the underlying density model of the 
different transformations is matched to the statistics of the data. Table [2] shows the average 
log-loss (ALL) and the differential log-likelihood (DLL) in bits per component. For the average 
log-loss, ICA achieved an ALL of 1.78 bits per component for chromatic images and 1.85 bits 
per component for achromatic images. Compared to the ALL in the RND representation of 
1.94 bits and 1.94 bits, respectively, the gain achieved by ICA is again small. Additionally, 
the ALL values were very close to the differential entropies, resulting in small DLL values. 
This confirms that the exponential power distribution fits the shape of the individual marginal 
coefficient distributions well. Therefore, we can safely conclude that the advantage of ICA is 
small not only in terms of redundancy reduction as measured by the multi-information, but 
also in the sense of density estimation. 
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Color 


Gray 


A 


ALL 


ALL 


RND 


1.9486 ± 0.0035 


1.9414 ± 0.0044 


SYM-RND 


-0.0881 ± 0.0004 


-0.0402 ± 0.0005 


PCA-RND 


-0.0751 ± 0.0009 


-0.0391 ± 0.0011 


ICA-RND 


-0.1637 ± 0.0007 


-0.0880 ± 0.0007 


SSD-RND 


-0.2761 ± 0.0025 


-0.2868 ± 0.0032 


B 


DLL 


(a) 


DLL 


(a) 


RND 


-0.0113 ± 0.0007 


1.0413 ± 0.0026 


-0.0057 ± 0.0006 


1.0132 ± 0.0046 


SYM 


-0.0388 ± 0.0009 


0.8961 ± 0.0021 


-0.0195 ± 0.0009 


0.9486 ± 0.0040 


PCA 


-0.0224 ± 0.0007 


0.9145 ± 0.0024 


-0.0087 ± 0.0007 


0.9425 ± 0.0025 


ICA 


-0.0378 ± 0.0009 


0.7687 ± 0.0017 


-0.0154 ± 0.0011 


0.8434 ± 0.0025 



Table 2: Comparision of the Average Log-Loss (ALL) and the Differential Log- 
Likelihood (DLL) Chromatic and Achromatic Images A. The first row shows the 
average log-loss (ALL, in bits/component) of the density model determined by the linear trans- 
formation RND. The value was obtained by averaging over 10 separately sampled training and 
test sets of size 40.000 and 50.000, respectively. The following rows shows the difference of 
the ALL of the models SYM, PCA, ICA and the spherically symmetric density (SSD) to the 
ALL determined by linear transformation RND. The large value for RND— ICA fundamentally 
contradicts the assumptions underlying the ICA model. B. The small DLL values suggest, that 
the exponential power distribution fits the shape of the individual coefficient distributions well. 
In addition, we also report the average exponent (a) of the exponential power family fit to the 
individual coefficient distributions (a = 1 corresponds to a Laplacian shape). 

Comparison to a spherical symmetric model The fact that ICA fits the distribution 
of natural images only marginally better than a random decorrelation transform implies that 
the generative model underlying ICA does not apply to natural images. In order to assess 
the importance of the actual filter shape, we fitted a spherically symmetric model to the filter 
responses. The likelihood of such a model is invariant under post-multiplication of an orthogonal 
matrix, i.e. the actual shape of the filter. Therefore, a good fit of such a model provides strong 
evidence against a critical role of certain filter shapes. 

As shown in Table |2] the ALL of the SSD model is 1.67 bits per component for chromatic 
images and 1.65 bits per component for achromatic images. This is significantly smaller than 
the ALL of ICA indicating that it fits the distribution of natural images much better than ICA 
does. This result is particularly striking if one compares the number of parameters fitted in the 
ICA model compared to the SSD case: After whitening, the optimization in ICA is done over 
the manifold of orthogonal matrices which has m^m — 1) /2 free parameters (where m denotes 
the number of dimensions without the DC components). The additional optimization of the 
shape parameters for the exponential power family fitted to each individual component adds 
another m parameters. For the case of 7 x 7 color image patches we thus have = 10440 

parameters. In stark contrast, there are only two free parameters in the SSD model with a 
radial Gamma distribution, the shape parameter u and the scale parameter s. Nevertheless, 
for chromatic images the gain of the SSD model relative to random whitening is almost twice 
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as large as that of ICA and even three and a half times as large for achromatic images. 

Since the SSD model is completely independent of the choice of the orthogonal transfor- 
mation after whitening, its superior performance compared with ICA provides a very strong 
argument against the hypothesis that orientation selectivity plays a critical role for redundancy 
reduction. In addition, it is also corroborates earlier arguments that has been given to show 
that the statistics of natural images does not conform to the generative model underlying ICA 

ggm]. 

Besides the better fit of the data by the SSD model, there is also a more direct way of 
demonstrating the dependencies of the ICA coefficients: If FwiCA = (yi, ■ ■ ■ tYn) is data in the 
wICA representation, then the independence assumption of ICA can be simulated by applying 
independent random permutations to the rows of l^icA- Certainly, such a shuffiing procedure 
does not alter the histograms of the individual coefficients but it is suited to destroy potential 
statistical dependencies among the coefficients. Subsequently, we can transform the shuffled 
data FsicA back to the RND basis ysRND = ICA- If the ICA coefficients were 

independent, the shuffiing procedure would not alter the joint statistics, and hence, one should 
find no difference in the multi-information between I^rnd and Yrnd- But infact, we observe a 
large discrepancy between the two (Figure |4]) . The distributions of the sRND coefficients were 
very close to Gaussians and the average marginal entropy of sRND yielded (/isRND — hcauss) ~ 
—0.001 bits in contrast to (/irnd — /iGauss) ~ —0.1 bits. In other words, the finding that for 
natural images the marginals of a random decorrelation transform have Laplacian shape (a ~ 1) 
stands in clear contradiction to the generative model underlying ICA. If the ICA model was 
valid, one would expect that the sum over the ICA coefficients would yield Gaussian marginals 
due to the central limit theorem. In conclusion, we have very strong evidence that the ICA 
coefficients are not independent in case of natural images. 
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Figure 4: The Distribution of Natural Images does not Conform with the Gener- 
ative Model of ICA In order to test for statistical dependencies among the coefficients 
YwiCA of whithened ICA for single data samples, the coefficients were shuffled among the data 
points along each dimension. Subsequently, we transform the resulting data matrix Kjica into 
i^RND = WrndW~j(jj^Ysica- This corresponds to a change of basis from the ICA to the 
random decorrelation basis (RND). The plot shows the log-histogram over the coefficients over 
all dimensions. If the assumptions underlying ICA were correct, there would be no difference 
between the histogram of ygRND and Irnd • 

Rate-distortion curves 

There are different ways to account for the information bottleneck that is imposed by neural 
noise and firing rate limitations. The advantage with respect to a plain information maximiza- 
tion criterion can equivalently be measured by the multi-information criterion considered above. 
In order to additionally account for the question which representation encodes most efficiently 
the relevant image information, we also present rate distortion curves which show the minimal 
reconstruction error as a function of the information rate. 

We compared the rate-distortion curves of wICA, nICA, wPCA and oPCA (see Figure [5|. 
Despite the fact that ICA is optimal in terms of redundancy reduction (see Table [2| , oPCA 
performs optimal with respect to coding efficiency. wPCA in turn performes worst and remark- 
ably similar to wICA. Since wPCA and wICA differ only by an orthogonal transformation, 
both representations are bound to the same metric. oPCA is the only transformation which 
has the same metric as the pixel representation according to which the reconstruction error is 
determined. By normalizing the length of the ICA basis vectors in the pixel space, the metric 
of nICA becomes more similar to the pixel basis and the performance with respect to cod- 
ing efficiency improved considerably. Nevertheless, for a fixed reconstrucion error the discrete 
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entropy after quantization in the oPCA basis is up to 1 bit/component smaller than for the 
corresponding nICA-basis. 



Reconstruction-error vs. Discrete Entropy 

10| ' ' : ' ^= 




Figure 5: Rate-distortion Curves Rate-distortion curve for PCA and ICA when equahz- 
ing the output variances (wPCA and wICA) and when equahzing the norm of the corresponding 
image bases in pixel space (oPCA and nICA). The plot shows the discrete entropy Hg in bits 
(averaged over all dimensions) against the log of the squared reconstruction error a^. oPCA 
outperforms all other transforms in terms of coding efficiency. wPCA in turn performes worst 
and remarkably similar to wICA. Since wPCA and wICA differ only by an orthogonal transfor- 
mation, both representations are bound to the same metric. oPCA is the only transformation 
which has the same metric as the pixel representation according to which the reconstruction 
error is determined. By normalizing the length of the ICA basis vectors in the pixel space, the 
metric of nICA becomes more similar to the pixel basis and the performance with respect to 
coding efficiency can be seen to improved considerably. 

In order to understand this result more precisely, we analyzed how the quantization of 
the coefficients affects the two variables of the rate-distortion function, discrete entropy and 
reconstruction error. 

Figure [6] shows an illustrative example in order to make the following analysis more intuitive. 
The example demonstrates that the quality of a transform code not only depends on the redun- 
dancy of the coefficients but also on the shape of the partition cells induced by the quantization. 
In particular, when the cells are small (i.e. the entropy rate is high), then the reconstruction 
error mainly depends on having cell shapes that minimize the average distance to the center of 
the cell. Linear transform codes can only produce partitions into parallelepipeds (Figure [6^). 
The best parallelepipeds are cubes (Figure |6]^) . This is why PCA yields the (close to) optimal 
trade-off between minimizing the redundancy and the distortion, as it is the only orthogo- 
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nal transform that yields uncorrelated coefficients. For a more comprehensive introduction to 
transform coding we refer the reader to the excellent review by Goyal |34j . 




-6 -4 -2 2 4 6 -6 -4 -2 2 4 6 



Figure 6: The Partition Cell Shape is Crucial for the Quantization Error The 

quality of a source code depends on the shapes of the partition cells and on varying the sizes 
of the cells according to the source density. When the cells are small (i.e. the entropy rate 
is high), then, the quality mainly depends on having cell shapes that minimize the average 
distance to the center of the cell. For a given volume, a body in Euclidean space that minimizes 
the average distance to the center is a sphere. The best packings (including the hexagonal case) 
cannot be achieved with linear transform codes. Transform codes can only produce partitions 
into parallelepipeds, as shown here for two dimensions. The best parallelepipeds are cubes. 
Therefore PCA yields the (close to) optimal trade-off between minimizing the redundancy and 
the distortion as it is the only orthogonal decorrelation transform (see [34 for more details). 
The figure shows 50.000 samples from a bivariate Gaussian random variable. Plot A depicts a 
uniform binning (bin width A = 0.01, only some bin borders are shown) induced by the only 
orthogonal basis for which the coefficients xi and X2 are decorrelated. Plot B shows uniform 
binning in a decorrelated, but not orthogonal basis (indicated by the blue lines). Both cases 
have been chosen such that the multi-information between the coefficients is identical and the 
same entropy rate was used to encode the signal. However, due to the shape of the bins in plot 
B the total quadratic error increases from 0.4169 to 0.9866. The code for this example can be 
also downloaded from http://www.kyb.tuebingen.mpg.de/bethge/code/QICA/ 

Discrete entropy Given a uniform binning of width 5 the discrete entropy Hg of a probability 
density p{x) is defined as 

iJa = - Pi log Pi with Pi= p{x)dx , (8) 

where Bi denotes the interval defined by the i-th bin. For small bin-sizes S ^ 0, there is a close 
relationship between discrete and differential entropy: Because of the mean value theorem we 
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can approximate pi « p{S,i)S with ^ Bi, and hence 

i 

i i 

^ V ■' ^ ' 

> — J p{x) log p{x) dx >1 

Thus, we have the relationship Hg w /i — log (5 for sufficiently small S (i.e. high-rate quanti- 
zation). In other words, Hg asymptotically grows linearly with (—log (5). Therefore, we can 
fit a linear function to the asymptotic branch of the function Hg = Hg{ — log (5) which is plot- 
ted in Figure [7]A (more precisely we are plotting the average over all dimensions). If we take 
the ordinate intercept of the linear approximation, we obtain a nonparametric estimate of the 
differential entropy which can be compared to the entropy estimates reported above (Those 
estimates were determined with the OPT estimator). Equivalently, one can consider the func- 
tion hg{— log 6) := Hg — (— log^) which gives a better visualization of the error of the linear 
approximation (Figure [t] left, dashed line). For hg{—logS) the differential entropy is obtained 
in the limit h = lim(_iog 5)^00 hg — lim^^o hg. 




-log(5) -log(5) 



Figure 7: Discrete vs. Differential Entropy A. Relationship between discrete and 
differential entropy. Discrete entropy (Hg) averaged over all channels as a function of the 
negative log-bin- width. The straight lines constitute the linear approximation to the asymptotic 
branch of the function. Their interception with the y-axis are visualized by the gray shaded, 
horizontal lines. The dashed lines represent (hg) which converge to the gray shaded lines for 
6^0. B. There are only small differences in the average discrete entropy for oPCA, wPCA, 
wICA, nICA as a function of the negative log-bin- width. Since the discrete entropy of the 
DC component is the same for all transforms, it is not included in that average but plotted 
separately instead. 
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This analysis shows that differences in differential entropy in fact translate into differences 
in discrete entropy after uniform quantization with sufficiently small bins. Accordingly, the 
minimization of the multi-information as proposed by the redundancy reduction hypothesis 
does in fact also minimize the discrete entropy of a uniformly quantized code. In particular, 
if we look at the discrete entropy of the four different transforms, oPCA, wPCA, wICA, nICA 
(Figure [7^), we find that asymptotically the two PCA transforms require slightly more entropy 
than the two ICA transforms, and there is no difference anymore between oPCA and wPCA 
or wICA and nICA. This close relationship between discrete and differential entropy for high- 
rate quantization, however, is not sufficient to determine the coding efficiency evaluated by the 
rate-distortion curve. The latter requires to compare also the reconstruction error for the given 
quantization. 

Reconstruction error The reconstruction error is defined as the mean squared distance 
in the pixel basis between the original image and the image obtained by reconstruction from 
the quantized coefficients of the considered transformation. For the reconstruction, we simply 
use the inverse of the considered transformation, which is optimal in the limit of high-rate 
quantization. 

Reconstruction-error vs. bin-width 

10 
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6 
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Figure 8: Reconstruction Error vs. Bin Width of Discrete Entropy Reconstruction 
error cr^ as a function of the bin-width 6, shown on a logarithmic scale. The differences between 
the different transforms are relatively large. Only the two transformations with exactly the same 
metric, wPCA and wICA, exhibit no difference in the reconstruction error. 

When looking at the reconstruction error as a function of the bin width (Figure |8| we can 
observe much more pronounced differences between the different transformations than it was 
the case for the entropy. As a consequence, the differences in the reconstruction error turn out 
to be much more important for coding efficiency than the differences in the entropy. Only the 
two transformations with exactly the same metric, wPCA and wICA, exhibit no difference in 
the reconstruction error. This suggests that minimization of the multi-information is strictly 
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related to efficient coding if and only if the transformation with respect to the pixel basis is 
orthogonal. As we have seen that the potential effect of higher-order redundancy reduction 
is rather small, we expect that the PCA transform constitutes a close approximation to the 
minimizer of the multi-information among all orthogonal transforms because PCA is the only 
orthogonal transform which removes all second-order correlations. 

Discussion 

The structural organization of orientation selectivity in the primary visual cortex has been 
associated with self-organization since the early seventies [35], and much progress has been 
made to narrow down the range of possible models compatible with the empirical findings [e.g. 
\EU[ I5T1 [52] . The link to visual information processing, however, still remains elusive [52 IMI US] ■ 

More abstract unsupervised learning models which obtain orientation selective filters using 
sparse coding [7] or ICA [21] try to address this link between image processing and the self- 
organization of neural structure. In particular, these models not only seek to reproduce the 
orientation tuning properties of VI simple cells but they additionally address the question of 
how the simple cell responses collectively can instantiate a representation for arbitrary images. 
Furthermore, these image representations are learned from an information theoretic principle 
assuming that the learned filters exhibit advantageous coding properties. 

The goal of this study is to quantitatively test how much evidence there is for this as- 
sumption. To this end, we investigated three criteria, the multi-information — i.e. the objective 
function of ICA — , the average log-loss, and rate-distortion curves. There are a number of pre- 
vious studies which also tried to quantify how large the advantage of the orientation selective 
ICA filters is relative to second-order decorrelation transformations. In particular, four papers 
[m [531 nil I2S|, are most closely related to this study as all of them compare the average log- 
loss of different transformations. Unfortunately, they do not provide a coherent answer to the 
question how large the advantage of ICA is compared to other decorrelation transforms. 

Lewicki and Olshausen [22] claim that their learned bases show a 15 - 20% improvement over 
traditional bases, but it is not really clear how exactly this number was obtained. According to 
reported values in their study (Table 1 in [22]) one could conclude that the advantage of ICA 
over PCA is even larger. Apart from that, a principle problem of this study is that the entire 
analysis is based on a dataset in which all images have been preprocessed with a bandpass 
filter as in [7] . Since bandpass filtering already removes a substantial fraction of second-order 
correlations in natural images, their study is likely to systematically underestimate the total 
amount of second-order correlations in natural images. Therefore, a valid comparison between 
second-order and higher-order redundancy reduction is not possible from their study. 

Lee et al [231 121] reported an advantage of over 100% percent for ICA in the case of color 
images and a more moderate but substantial gain of about 20% for gray-value images. In order 
to avoid possible differences due to the choice of data set we here used exactly the same data as 
in [23l [24]. Very consistently, we find only a small advantage for ICA of less than five percent 
for both multi-information and the average log-loss. In particular, we are not able to reproduce 
the very large difference between color and gray-value images that they reported. 

Unfortunately, it is not clear which estimation procedure was used in [23] [Mj. Therefore, we 
cannot pinpoint where the differences in the numbers ultimately come from. There seems to be 
an inconsistency though between the kurtosis estimates and the average log-loss they reported. 
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The kurtosis they find for ICA is smaller than twenty. Thus, for the exponential power family 
which fits the coefficient statistics very well, we can derive that the Negentropy has to be less 
than 0.52 bits per pixel (cf. Equations (9) and (12) in [26]). The Negentropy quantifies the 
difference between the entropy of the maximum entropy distribution of same variance minus 
the true entropy of the distribution. Since the redundancy reduction due to PCA or any other 
decorrelation transform is around 4 bits per pixel, the maximum possible advantage for ICA 
should be less than 15% even if one uses the kurtosis estimate taken from [53 121] . 

The estimators for the measures used for comparing the different linear transforms in the 
present study are well designed and have been shown to give correct results on artificial data 
[2B]. Furthermore, Weiss and Freeman showed for an undirected probabilistic image model that 
whitening already yields 98% of the total performance ^56j . Finally, the superior performance 
of the simple SSD model with only two free parameters provides a very strong explanation for 
why the gain achieved with ICA is so small relative to a random decorrelation transform: Since 
a spherically symmetric model is invariant under orthogonal transformations and provides a 
better fit to the data, the actual shape of the filter does not seem to be critical. It also shows 
that the fundamental assumption underlying ICA — the data are well described by a linear 
generative model with independent sources — is not justified in the case of natural images. 

From all these results, we can safely conclude that the actual gain of ICA compared to PCA 
is smaller than 5% for both gray level images and color images. 



Is smaller than 5% really small? A valid question to ask is whether comparing the amount 
of higher-order correlations to the amount of second-order correlations is the right thing to do. 
The reason why "bits" are the canonical units used to measure the multi-information and the 
average log-loss, originates from a coding theoretical theorem which says that the entropy is an 
(asymptotically) attainable lower bound to the average code word length (Differential entropy 
can be related in a meaningful way to discrete entropy assuming sufficiently fine quantization.) 
|35j . Hence, if the brain uses codes which are close to optimal, the multi- information measured 
in bits would correspond to the amount of resources (e.g. number of neurons or number of 
spikes) that it could save by the redundancy reduction. If the brain uses a suboptimal code, 
say a grandmother code for instance, then the number of necessary neurons would rather 
equal the number of states which grows exponentially with entropy. If we now suppose that 
Hpix > HnisiD > HjcA is the entropy of the pixels, of random whitening, and of ICA after 
quantization, respectively, then 

npix — UrND 

as reported in Table [T] implies that 



«5%. (10) 



In this sense we can argue that measuring bits (i.e. the "log-number" of states) rather than 
the number of states is a conservative way of showing that the improvement of ICA relative to 
decorrelation is small. 

We have computed complete rate distortion curves for ICA and PCA showing that ICA 
performs even worse than PCA with respect to coding efficiency because of its unfavorable 
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metric. By disentangling the rate-distortion curves into bin width — rate curves and bin width — 
distortion curves, we show that the average log-loss can only account for differences in the 
rate for a given quantization bin width. It does not take into account, however, that the 
reconstruction error of two transforms can be quite different for a fixed bin width if they do not 
have the same metric, i.e. if they differ by more than just an orthogonal transformation. This 
is due to the fact that for sufficiently small bin widths the quality of the encoding depends on 
having cell shapes that minimize the average distance to the center of the cell [31]. Therefore, 
non-orthogonal decorrelation transforms suffer from disadvantageous cell shapes (Figure [6|. 
This explains why ICA performs even worse than PCA with respect to coding efficiency in 
terms of mean squared reconstruction error. Hence, the advantage of ICA in redundancy 
reduction does not pay off in terms of reconstruction error. Finally, also perceptually image 
patches sampled from the ICA model do not look more similar to natural image patches than 
those sampled from the random decorrelation basis (Figure l9l . 



RND RAW ICA 




Figure 9: Comparison of Patches Sampled Prom Different Image Models The figure 
demonstrates that the visual appearance of samples from the ICA image model (right) do not 
look significantly more similar to samples from natural images (middle) than samples from the 
image model (left) with random (pink noise) basis functions. 



In summary, we were not able thus far to come up with a meaningful interpretation for which 
the improvement of ICA would be recognized as being large. If such an interpretation exists it 
would very desirable to know as it might point to a possible refinement of the interpretation 
of the redundancy reduction hypothesis. On the basis of the present study, however, it seems 
rather unlikely that such a measure can be found. Instead, we think that our study provides 
solid evidence that orientation selective filters in a linear representation do not yield a clear 
advantage in terms of redundancy reduction and are rather suboptimal in terms of coding 
efficiency. 



What about nonparametric approaches? The focus on linear redundancy reduction mod- 
els in this study is motivated by the goal to first establish a solid and reproducible result for the 
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simplest possible case before moving on to more involved nonlinear transformations. Neverthe- 
less, it is important to discuss what we can expect if the restriction to linear transformations 
is dropped. From a nonparametric analysis |25j . Petrov and Zhaoping concluded that higher- 
order correlations in general contribute only very little to the redundancy in natural images and, 
hence, are probably not the main cause for the receptive field properties in VI. The empirical 
support for this claim, however, is not very strong as their comparison is based on mutual in- 
formation estimates within a very small neighborhood of five pixels only. This is problematic as 
it is known that many kinds of higher-order correlations in natural images manifest themselves 
only in much higher-dimensional statistics |57j . Furthermore, their estimate of the amount of 
second-order correlations is not invariant against point-wise nonlinear transformations of the 
pixel intensities. 

In a more recent non-parametric study. Chandler and Field arrived at a very different 
conclusion than Petrov and Zhaoping. They use nearest-neighbor based methods to estimate 
the joint entropy of natural images in comparison to pink noise and white noise |58j where pink 
noise denotes Gaussian noise with the same spectrum as that of natural images. As a result 
(Fig. 18) they find a smaller difference between pink noise and white noise as between natural 
images and pink noise. Thus, from their finding, it seems that the amount of higher-order 
correlations in natural images is even larger than the amount of second-order correlations. Also 
this result has to be taken with care: Apart from the fact that reliable non-parametric estimates 
in high-dimensions are difficult to obtain even if one resorts to nearest-neighbor based methods, 
the estimate of the amount of second-order correlations in [SS] is not invariant against pointwise 
nonlinear transformations of the pixel intensities, too. 

In summary, the results of the present nonparametric studies do not yield a clear picture 
how large the total amount of higher-order correlations in natural images really is. In fact, 
estimating the absolute amount of multi-information is an extremely difficult task in high 
dimensions. Therefore, the differences in the results can easily originate from the different 
assumptions and approximations made in these studies. 

What about nonlinear image models? Apart from the non-parametric approaches, a 
large number of nonlinear image models has been proposed over the years which are capable 
to capture significantly more statistical regularities of natural images than linear ICA can do 
(e.g. [SgEOllSelETllMllMllMllMl ES]). in fact, Olshausen and Field [7] already used a more 
general model than linear ICA when they originally derived the orientation selective filters 
from higher-order redundancy reduction. In contrast to plain ICA, they used an overcomplete 
generative model which assumes more source signals than pixel dimensions. In addition, the 
sources are modeled as latent variables like in a factor analysis model. That is the data is 
assumed to be generated according to x = -I- ^ where A denotes the overcomplete dictionary, 
s is distributed according to a sparse factorial distribution, and ^ is a Gaussian random variable. 
The early quantitative study by Lewicki and Olshausen [22J could not demonstrate an advantage 
of overcomplete coding in terms of coding efficiency and also the more recent work by Seeger 
seems to confirm this conclusion. The addition of a Gaussian random variable ^ to As, 
however is likely to be advantageous as it may help to interpolate betweem the plain ICA model 
on the one hand and the spherically symmetric model on the other hand. A comparison of the 
average log-loss between this model and plain ICA has not been done yet but we can expect 
that this model can achieve a similar or even better match to the natural image statistics as 
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the spherically symmetric model. 

The spherical symmetric model can also be modeled by a redundancy reduction transforma- 
tion which changes the radial component such that the output distribution is sought to match 
a Gaussian distribution |46j . Hence, the redundancy reduction of this model is very similar to 
the average log-loss of the spherically symmetric distribution. From a biological vision point of 
view, this type of model is particularly interesting as it allows one to draw a link to divisive 
normalization, a prominent contrast gain control mechanism observed for virtually all neurons 
in the early visual system. Our own ongoing work |67j shows that this idea can be generalized 
to a larger class of Lp-spherically symmetric distributions [61] . In this way, it is possible to find 
an optimal interpolation between ICA and the spherically symmetric case |68j . That is, one can 
combine orientation selectivity with divisive normalization in a joint model. Our preliminary 
results suggests that optimal divisive normalization together with orientation selectivity allows 
for about 10% improvement while divisive normalization alone (i.e. the spherical symmetric 
model) is only 2% worse [67] . 

Note that, in general, orientation selectivity can only play a critical role if one assumes 
some form of regularization on the function space of possible transformations. Otherwise, the 
optimization problem would be highly degenerated [69] . Thus, even if we drop the constraint of 
linearity, we probably want to stay somewhat close to a linear function space. Typical examples 
would be the generalized linear model (i.e. adding pointwise nonlinearities) which does not have 
an effect on the multi-information at all, or — more interestingly — a scaling nonlinearity such as 
divisive normalization which we discussed above. 

Concluding remarks Taken together, the current stage of statistical image modeling and 
quantitative model comparison suggests that it is rather unlikely that orientation selectivity 
plays a critical role for redundancy reduction even if the class of transformations is not limited 
to linear ones. This finding urges us to rethink how we can interpret the function of orientation 
selectivity in primary visual cortex. For one, we should further explore and quantify how 
nonlinearities and other known properties of VI contribute to redundancy reduction. On the 
other hand, we should also try to extend the quantitative approach to include other goals than 
redundancy reduction only. While there are certainly many ideas about possible functions of 
orientation selective filtering, the crucial challenge is to turn those ideas into a quantitatively 
testable framework. For instance, many algorithms for edge detection, image segmentation, and 
texture classification use orientation selective filtering as a preprocessing step. Most frequently, 
however, it is not clear how critical this choice of image basis is for solving these tasks. 

In order to achieve a more principled take on the potential use of orientation selectivity 
for early vision we are currently trying to extend the efficient coding framework to deal with 
other loss functions than mean squared error on the image pixels. Obviously, the goal of the 
visual system is not to preserve the representation of the visual input. Instead, seeing means to 
make successful predictions about behaviorally relevant aspects of the environment [70] . Since 
3D shape inference is necessary to almost any naturally relevant task, it seems particularly 
interesting to explore the role of orientation selectivity in the context of 3D shape inference 
[71 j . For a quantitative account of this problem one can seek to minimize the reconstruction 
error for the 3D shape rather than for its 2D image. Certainly, this task is much more involved 
than image reconstruction and there is still a long way to go. Nevertheless, we believe that 
going in this direction is a worthwhile enterprise which eventually may help us to unravel the 
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principles of neural processing in the brain that are ultimately responsible for our ability to see. 
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